Skip to content

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Nov 2, 2025

Description

This PR modifies the process_systems utility function to change how it handles list inputs.

Previously, if the systems argument was a str, the function would recursively search that path for systems. However, if systems was a list, the function would return the list as-is, assuming it was already a complete list of system paths.

This update unifies the logic. The function now treats every string path—whether it's a single str input or an item within a list—as a directory to be recursively searched. It also refactors the internal logic to first normalize the input into a list of paths and then process them uniformly, improving code clarity and maintainability.

Motivation and Justification

The original implementation's inconsistent handling of str versus list inputs caused two significant problems:

  1. Broken JSON Configurations: A very common use case, specifying a single data directory in input.json like "systems": ["/path/to/training_data"], would fail. The function would not search inside /path/to/training_data for the actual system directories (e.g., set.000, set.001, etc.).
  2. Inability to Aggregate Data: It was impossible for users to combine multiple datasets by providing a list of top-level directories, such as "systems": ["/path/to/dataset_A", "/path/to/dataset_B"].

This change solves both problems by ensuring that paths provided in a list are searched recursively, just as a single string path would be.

Benefits

  • Fixes Bug: Correctly processes the common configuration of a single-item list in input.json.
  • Enables Data Aggregation: Users can now successfully provide a list of multiple data directories to be searched and combined.
  • Improves Consistency: The function's behavior is now intuitive and consistent, regardless of whether the user provides a single str or a list[str].

Summary by CodeRabbit

  • Documentation

    • Clarified how training/validation system paths may be specified: a single system directory or a parent directory to recursively discover systems; lists of paths are explicitly supported and processed per-item.
  • Improvements

    • Input handling for system paths enhanced to accept multiple paths and consolidate discovered systems, with expanded pattern-based discovery for more flexible data input.

Copilot AI review requested due to automatic review settings November 2, 2025 04:40
@github-actions github-actions bot added the Python label Nov 2, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the process_systems function to handle lists of system directory paths by iterating over each path and applying expansion logic individually. Previously, when systems was a list, it was simply copied; now each item is processed through the same expansion logic as single string inputs.

Key changes:

  • Modified process_systems to iterate over list items and expand each path individually
  • Updated documentation for better clarity on how the function handles both strings and lists

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
deepmd/utils/data_system.py Refactored process_systems to expand each list item individually, improved docstring clarity, and added explicit type checking with error handling
deepmd/utils/argcheck.py Updated documentation strings to clarify behavior for string and list inputs for both training and validation data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 2, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Updated documentation for training and validation data argument descriptions to clarify system path handling. Modified process_systems to accept either a single path string or a list of path strings, normalize to a list, expand each entry (via expand_sys_str or rglob_sys_str), and accumulate results into a consolidated list; invalid input types raise ValueError.

Changes

Cohort / File(s) Summary
Documentation updates for data arguments
deepmd/utils/argcheck.py
Revised docstrings for training_data_args and validation_data_args to state that a systems value may be a system directory path (containing type.raw) or a parent directory to recursively search for system subdirectories; when a list is provided, each string item is processed the same way.
Path processing enhancement
deepmd/utils/data_system.py
process_systems now accepts a single string or a list of strings, normalizes input to a list, iterates over each path applying expand_sys_str (when patterns is None) or rglob_sys_str (when patterns provided), extends a consolidated result_systems list with each expansion, and raises ValueError for unsupported input types.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant process_systems
    participant expand_sys_str
    participant rglob_sys_str
    rect `#e6f7ff`
    Caller->>process_systems: call with systems (str or list)
    end
    alt input is str
        process_systems->>process_systems: normalize -> [str]
    end
    loop for each path in list
        process_systems->>process_systems: decide pattern present?
        alt patterns is None
            process_systems->>expand_sys_str: expand path
            expand_sys_str-->>process_systems: expanded_paths
        else patterns present
            process_systems->>rglob_sys_str: rglob path with patterns
            rglob_sys_str-->>process_systems: matched_paths
        end
        process_systems->>process_systems: extend result_systems
    end
    process_systems-->>Caller: return consolidated result_systems (list)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Extra attention areas:
    • Edge cases: empty lists, duplicate paths, and mixed path types
    • Backward compatibility with callers expecting previous return shapes
    • Clarity and content of the ValueError raised for unsupported input types

Possibly related PRs

Suggested reviewers

  • njzjz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Enhance process_systems to recursively search all paths in systems list' directly and specifically summarizes the main change: updating process_systems to recursively search each path in a systems list.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
deepmd/utils/data_system.py (2)

808-817: Consider TypeError and add list item validation.

Two suggestions for improvement:

  1. Use TypeError for invalid types: When the input type is incorrect, TypeError is more semantically appropriate than ValueError (as suggested by static analysis).

  2. Validate list items are strings: Currently, if systems is a list containing non-string items (e.g., ["/path", None] or ["/path", 123]), the error will occur later in expand_sys_str or rglob_sys_str, producing a potentially confusing error message.

Apply this diff to improve type handling:

 # Normalize input to a list of paths to search
 if isinstance(systems, str):
     search_paths = [systems]
 elif isinstance(systems, list):
+    # Validate all items are strings
+    for idx, item in enumerate(systems):
+        if not isinstance(item, str):
+            raise TypeError(
+                f"All items in systems list must be str, but systems[{idx}] is {type(item).__name__}."
+            )
     search_paths = systems
 else:
     # Handle unsupported input types
-    raise ValueError(
-        f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+    raise TypeError(
+        f"systems must be str or list[str], got {type(systems).__name__}."
     )

819-829: Consider deduplicating the final result for edge cases.

The current implementation may return duplicate system paths if:

  • Multiple paths in the input list overlap (e.g., ["/data", "/data/subset1"])
  • The same path appears multiple times in the input list

While rglob_sys_str deduplicates within each path search, expand_sys_str does not, and duplicates across different search paths are not removed.

If deduplication is desired, you could modify the return statement:

     result_systems.extend(expanded_paths)

-return result_systems
+# Deduplicate while preserving order
+seen = set()
+deduplicated = []
+for system in result_systems:
+    if system not in seen:
+        seen.add(system)
+        deduplicated.append(system)
+return deduplicated

Alternatively, if order doesn't matter:

-return result_systems
+return list(dict.fromkeys(result_systems))  # Preserves order in Python 3.7+

Or simply:

-return result_systems
+return list(set(result_systems))  # May change order
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b970b81 and b0d858f.

📒 Files selected for processing (2)
  • deepmd/utils/argcheck.py (2 hunks)
  • deepmd/utils/data_system.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data_system.py
  • deepmd/utils/argcheck.py
🧬 Code graph analysis (1)
deepmd/utils/data_system.py (1)
deepmd/common.py (2)
  • expand_sys_str (191-208)
  • rglob_sys_str (211-232)
🪛 Ruff (0.14.2)
deepmd/utils/data_system.py

815-817: Prefer TypeError exception for invalid type

(TRY004)


815-817: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (47)
  • GitHub Check: CodeQL analysis (python)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (3)
deepmd/utils/argcheck.py (2)

2994-2998: LGTM! Clear documentation of systems parameter behavior.

The updated documentation accurately describes the new behavior where both string and list inputs undergo recursive directory search, with each list item processed identically to a standalone string input.


3075-3079: LGTM! Consistent documentation across training and validation data.

The validation data documentation mirrors the training data documentation, maintaining consistency and clarity.

deepmd/utils/data_system.py (1)

787-807: LGTM! Updated docstring accurately reflects new behavior.

The docstring clearly describes the new uniform treatment of both string and list inputs, where each path is recursively searched for system directories.

@codecov
Copy link

codecov bot commented Nov 2, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.23%. Comparing base (7778e2e) to head (71921b9).
⚠️ Report is 1 commits behind head on devel.

Files with missing lines Patch % Lines
deepmd/utils/data_system.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #5033      +/-   ##
==========================================
- Coverage   84.23%   84.23%   -0.01%     
==========================================
  Files         709      709              
  Lines       70075    70081       +6     
  Branches     3618     3619       +1     
==========================================
+ Hits        59027    59032       +5     
- Misses       9881     9883       +2     
+ Partials     1167     1166       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@iProzd iProzd requested a review from njzjz November 3, 2025 08:54
Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation:

- {ref}`systems <training/training_data/systems>` provide paths of the training data systems. DeePMD-kit allows you to provide multiple systems with different numbers of atoms. This key can be a `list` or a `str`.
- `list`: {ref}`systems <training/training_data/systems>` gives the training data systems.
- `str`: {ref}`systems <training/training_data/systems>` should be a valid path. DeePMD-kit will recursively search all data systems in this path.

@github-actions github-actions bot added the Docs label Nov 7, 2025
@OutisLi OutisLi requested a review from njzjz November 7, 2025 07:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/utils/data_system.py (1)

808-817: Use TypeError for invalid type validation.

Static analysis correctly suggests using TypeError instead of ValueError when validating input types, as this is the more semantically appropriate exception for type-related errors.

Apply this diff:

     else:
-        # Handle unsupported input types
         raise ValueError(
-            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
         )
+        raise TypeError(
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+        )

Actually, let me fix that diff:

-    else:
-        # Handle unsupported input types
-        raise ValueError(
-            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
-        )
+    else:
+        raise TypeError(
+            f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+        )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b0d858f and 71921b9.

📒 Files selected for processing (3)
  • deepmd/utils/argcheck.py (2 hunks)
  • deepmd/utils/data_system.py (2 hunks)
  • doc/train/training-advanced.md (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data_system.py
  • deepmd/utils/argcheck.py
🧬 Code graph analysis (1)
deepmd/utils/data_system.py (1)
deepmd/common.py (2)
  • expand_sys_str (191-208)
  • rglob_sys_str (211-232)
🪛 LanguageTool
doc/train/training-advanced.md

[uncategorized] ~79-~79: Loose punctuation mark.
Context: ...ey can be a list or a str. - str: {ref}`systems <training/training_data/s...

(UNLIKELY_OPENING_PUNCTUATION)


[uncategorized] ~80-~80: Loose punctuation mark.
Context: ...or all system subdirectories. - list: {ref}`systems <training/training_data/s...

(UNLIKELY_OPENING_PUNCTUATION)

🪛 Ruff (0.14.3)
deepmd/utils/data_system.py

815-817: Prefer TypeError exception for invalid type

(TRY004)


815-817: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (16)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (5)
deepmd/utils/argcheck.py (2)

2996-2998: LGTM! Documentation accurately reflects the enhanced behavior.

The docstring clearly describes both input variants (string and list) and how each is processed. The documentation is consistent with the implementation changes in process_systems.


3078-3080: LGTM! Consistent documentation for validation data.

The docstring mirrors the training_data_args documentation and accurately describes the behavior for validation data systems.

doc/train/training-advanced.md (1)

79-80: LGTM! User documentation is clear and accurate.

The documentation clearly explains both input variants and emphasizes that list items are processed the same way as individual string inputs, which helps users understand the recursive search behavior.

deepmd/utils/data_system.py (2)

819-827: LGTM! Clear iteration logic with proper expansion.

The iteration over search_paths with appropriate expansion logic (either expand_sys_str or rglob_sys_str) correctly implements the enhanced functionality described in the PR objectives.


808-829: Address deduplication inconsistency: expand_sys_str paths are not deduplicated.

The review comment raises valid but partially addressed concerns:

  1. Empty list handling: An empty result_systems list is properly validated downstream. When process_systems returns an empty list, DeepmdDataSystem.__init__ (line 104-105) raises ValueError("No systems provided") with a clear error message. This is handled as designed.

  2. Deduplication: Inconsistency exists between code paths:

    • When patterns is not None: rglob_sys_str already deduplicates via list(set(matches)) at line 232 of deepmd/common.py
    • When patterns is None: expand_sys_str does not deduplicate; if multiple paths in search_paths yield overlapping directories, duplicates persist

For consistency and robustness, consider adding deduplication at the end of process_systems (line 829) to handle all scenarios uniformly, or ensure expand_sys_str deduplicates when called via process_systems.

@njzjz njzjz added this pull request to the merge queue Nov 7, 2025
Merged via the queue into deepmodeling:devel with commit 25fa707 Nov 7, 2025
60 checks passed
@OutisLi OutisLi deleted the pr/multi_system branch November 10, 2025 02:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants